DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Briefly describe the data
  • Anticipate the End Result
    • Example: find current and future data dimensions
    • Challenge: Describe the final dimensions
  • Pivot the Data
    • Example
    • Challenge: Pivot the Chosen Data

Challenge 3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Mariia Dubyk

Published

November 1, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
Code
library(readxl)
marriage <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
                       sheet = "Table 2",
                       col_names = c("Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Delete", "Delete", "Delete", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage"))

marriage <- slice(marriage, 8:180)
marriage <- select(marriage, "Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage")

Briefly describe the data

The dataset shows responses to the question in Australian Marriage Law Postal Survey 2017. We can see data gathered in different divisions of States and Territories in Australia. The responses are “Yes”, “No”, “Not clear” and “No response”. The table also shows percentage of each answer and total numbers.

To tidy data I will leave only divisions in the first column. I am going to move State and Territories to another column by adding a variable (as it was shown during the class).I will also remove percentages and total numbers from the table.

Code
marriage <- marriage%>%
  mutate(StateTerritory = case_when(str_ends(Division, "Divisions") ~ Division), NA_character_)%>%
  fill(StateTerritory, .direction = "down")
marriage <- filter(marriage, !str_detect(Division, "Division|Australia"))
marriage <- select(marriage, "Division", "Yes", "No", "Response_not_clear", "Non_responding", "StateTerritory")
marriage
# A tibble: 155 × 6
   Division  Yes   No    Response_not_clear Non_responding StateTerritory       
   <chr>     <chr> <chr> <chr>              <chr>          <chr>                
 1 Banks     37736 46343 247                20928          New South Wales Divi…
 2 Barton    37153 47984 226                24008          New South Wales Divi…
 3 Bennelong 42943 43215 244                19973          New South Wales Divi…
 4 Berowra   48471 40369 212                16038          New South Wales Divi…
 5 Blaxland  20406 57926 220                25883          New South Wales Divi…
 6 Bradfield 53681 34927 202                17261          New South Wales Divi…
 7 Calare    54091 35779 285                25342          New South Wales Divi…
 8 Chifley   32871 46702 263                28180          New South Wales Divi…
 9 Cook      47505 38804 229                18713          New South Wales Divi…
10 Cowper    57493 38317 315                25197          New South Wales Divi…
# … with 145 more rows

Anticipate the End Result

In the final dataframe we will have 4 rows (with answers “Yes”, “No”,“Response_not_clear”, “Non_responding”) for each division.

Example: find current and future data dimensions

Lets see if this works with a simple example.

Code
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
           year = rep(c(1980,1990), 3), 
           trade = rep(c("NAFTA", "NAFTA", "EU"),2),
           outgoing = rnorm(6, mean=1000, sd=500),
           incoming = rlogis(6, location=1000, 
                             scale = 400))
df
# A tibble: 6 × 5
  country  year trade outgoing incoming
  <chr>   <dbl> <chr>    <dbl>    <dbl>
1 Mexico   1980 NAFTA     363.     847.
2 USA      1990 NAFTA    1299.     367.
3 France   1980 EU        958.     761.
4 Mexico   1990 NAFTA    1825.     635.
5 USA      1980 NAFTA     882.    1407.
6 France   1990 EU       1443.    1994.
Code
#existing rows/cases
nrow(df)
[1] 6
Code
#existing columns/cases
ncol(df)
[1] 5
Code
#expected rows/cases
nrow(df) * (ncol(df)-3)
[1] 12
Code
# expected columns 
3 + 2
[1] 5

Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.

Challenge: Describe the final dimensions

Any additional comments?

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code
df<-pivot_longer(df, col = c(outgoing, incoming),
                 names_to="trade_direction",
                 values_to = "trade_value")
df
# A tibble: 12 × 5
   country  year trade trade_direction trade_value
   <chr>   <dbl> <chr> <chr>                 <dbl>
 1 Mexico   1980 NAFTA outgoing               363.
 2 Mexico   1980 NAFTA incoming               847.
 3 USA      1990 NAFTA outgoing              1299.
 4 USA      1990 NAFTA incoming               367.
 5 France   1980 EU    outgoing               958.
 6 France   1980 EU    incoming               761.
 7 Mexico   1990 NAFTA outgoing              1825.
 8 Mexico   1990 NAFTA incoming               635.
 9 USA      1980 NAFTA outgoing               882.
10 USA      1980 NAFTA incoming              1407.
11 France   1990 EU    outgoing              1443.
12 France   1990 EU    incoming              1994.

Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!

Challenge: Pivot the Chosen Data

After pivoting the case is number of certain response in a division. I think it may be easier to visualize the data using this type of table. So we may say that data is tidy because it is easier to make next steps in analyzing data.

Code
marriage_new<-pivot_longer(marriage, col = c(Yes, No, Response_not_clear, Non_responding),
                 names_to="Response_type",
                 values_to = "Number")
marriage_new
# A tibble: 620 × 4
   Division  StateTerritory            Response_type      Number
   <chr>     <chr>                     <chr>              <chr> 
 1 Banks     New South Wales Divisions Yes                37736 
 2 Banks     New South Wales Divisions No                 46343 
 3 Banks     New South Wales Divisions Response_not_clear 247   
 4 Banks     New South Wales Divisions Non_responding     20928 
 5 Barton    New South Wales Divisions Yes                37153 
 6 Barton    New South Wales Divisions No                 47984 
 7 Barton    New South Wales Divisions Response_not_clear 226   
 8 Barton    New South Wales Divisions Non_responding     24008 
 9 Bennelong New South Wales Divisions Yes                42943 
10 Bennelong New South Wales Divisions No                 43215 
# … with 610 more rows
Source Code
---
title: "Challenge 3"
author: "Mariia Dubyk"
desription: "Tidy Data: Pivoting"
date: "11/01/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - animal_weights
  - eggs
  - australian_marriage
  - usa_households
  - sce_labor
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

```{r}
library(readxl)
marriage <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
                       sheet = "Table 2",
                       col_names = c("Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Delete", "Delete", "Delete", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage"))

marriage <- slice(marriage, 8:180)
marriage <- select(marriage, "Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage")

```

### Briefly describe the data

The dataset shows responses to the question in Australian Marriage Law Postal Survey 2017. We can see data gathered in different divisions of States and Territories in Australia. The responses are "Yes", "No", "Not clear" and "No response". The table also shows percentage of each answer and total numbers.

To tidy data I will leave only divisions in the first column. I am going to move State and Territories to another column by adding a variable (as it was shown during the class).I will also remove percentages and total numbers from the table.

```{r}
marriage <- marriage%>%
  mutate(StateTerritory = case_when(str_ends(Division, "Divisions") ~ Division), NA_character_)%>%
  fill(StateTerritory, .direction = "down")
marriage <- filter(marriage, !str_detect(Division, "Division|Australia"))
marriage <- select(marriage, "Division", "Yes", "No", "Response_not_clear", "Non_responding", "StateTerritory")
marriage
```


## Anticipate the End Result

In the final dataframe we will have 4 rows (with answers "Yes", "No","Response_not_clear", "Non_responding") for each division. 


### Example: find current and future data dimensions

Lets see if this works with a simple example.

```{r}
#| tbl-cap: Example

df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
           year = rep(c(1980,1990), 3), 
           trade = rep(c("NAFTA", "NAFTA", "EU"),2),
           outgoing = rnorm(6, mean=1000, sd=500),
           incoming = rlogis(6, location=1000, 
                             scale = 400))
df

#existing rows/cases
nrow(df)

#existing columns/cases
ncol(df)

#expected rows/cases
nrow(df) * (ncol(df)-3)

# expected columns 
3 + 2
```

Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.

### Challenge: Describe the final dimensions


```{r}
```

Any additional comments?

## Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.

### Example

```{r}
#| tbl-cap: Pivoted Example

df<-pivot_longer(df, col = c(outgoing, incoming),
                 names_to="trade_direction",
                 values_to = "trade_value")
df
```

Yes, once it is pivoted long, our resulting data are $12x5$ - exactly what we expected!

### Challenge: Pivot the Chosen Data

After pivoting the case is number of certain response in a division. I think it may be easier to visualize the data using this type of table. So we may say that data is tidy because it is easier to make next steps in analyzing data. 

```{r}
marriage_new<-pivot_longer(marriage, col = c(Yes, No, Response_not_clear, Non_responding),
                 names_to="Response_type",
                 values_to = "Number")
marriage_new
```